VASP on a GPU: application to exact-exchange calculations of the stability of 

elemental boron 
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General purpose graphical processing units (CPU's) offer high processing speeds for certain classes 
of highly parallelizable computations, such as matrix operations and Fourier transforms, that lie at 
the heart of first-principles electronic structure calculations. Inclusion of exact-exchange increases 
the cost of density functional theory by orders of magnitude, motivating the use of CPU's. Porting 
the widely used electronic density functional code VASP to run on a GPU results in a 5-20 fold 
performance boost of exact-exchange compared with a traditional CPU. We analyze performance 
bottlenecks and discuss classes of problems that will benefit from the GPU. As an illustration of the 
capabilities of this implementation, we calculate the lattice stability a- and /?-rhombohedral boron 
structures utilizing exact-exchange. Our results confirm the energetic preference for symmetry- 
breaking partial occupation of the /3-rhombohedral structure at low temperatures, but does not 
resolve the stability of a relative to /?. 

PACS numbers: 61.50.Lt,61.43.Dq, 71.20.Be, 81.30.Bx 
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I. INTRODUCTION 

First principles quantum mechanical calculations of to- 
tal energy are among the most pervasive and demand- 
ing applications of supercomputers. The problem is an 
interacting many-body problem whose wavefunction de- 
pends on the coordinates of Ne electrons. The computa- 
tional cornplexity grows exponentially with the number of 
electrons [l| , thus severely hmiting the number of atoms 
that can be treated. Elemental boron, for example, with 
its highly complex crystal structure containing approx- 
imately 107 atoms, hes well beyond the limits of exact 
energy calculation. 

Replacing the many-body problem with Nf, coupled 
single electron problems reduces the exponential depen- 
dence to a polynomial, at the cost of introducing ap- 
proximations. For example, both Hartree-Fock (HF) 
and electronic density functional theory (DFT) are for- 
mally O(iVg) in complexity 0]. However, further ap- 
proximations can additionally reduce the dependence on 
Ne- Indeed, 0{Ne) methods are possible in principle 
provided the problem is sufficiently local Q. In practice, 
for the physics problems to be discussed below (utiliz- 
ing the plane- wave-based code VASP), the complexity of 
HF-type hybrid functionals [J] is ©(TV^ logTVe) and for 
DFT g it is 0{Nl\ogNe). The benefit of HF-type cal- 
culation is the exact treatment of electron exchange in- 
teractions, while the net cost amounts to orders of mag- 
nitude in run time. 

While the run time could possibly be reduced by run- 
ning on a faster computer, actual frequencies of computer 
chips have held rather constant in recent years. Instead 
of running at higher frequencies, the trend has been to 
increase computer power through parallelization, using 
multi-core chips, multi-chip nodes, and multi-node com- 
puter systems. Recently massively parallel processing 



became available for low-cost computer systems through 
the introduction of general purpose graphical process- 
ing units (GPU). These systems can contain hundreds 
of cores, with a low cost and low power consumption 
per core. It is thus of high interest to evaluate the suit- 
ability of GPU systems for practical electronic structure 
calculations. The fast Fourier transformation (which is 
an important VASP bottleneck) has been ported to the 
GPU g . Here we address the hybrid HF-DFT function- 
als that include exact-exchange. 

Below, we first first summarize the physics problem to 
be addressed, then describe our implementation of VASP 
on a GPU system, and finally we apply it to study the 
energies of competing boron structures. Our key results 
are: 1) exact-exchange calculations confirm that the sym- 
metric /? structure is energetically unfavorable relative 
to a, but its energy can be reduced through symmetry 
breaking partial site occupancy. 2) The GPU system out- 
performs the CPU, with speedups reaching a factor of 20 
in computationally demanding cases. 



II. ELEMENTAL BORON 

Boron is of practical interest because of its light weight, 
high strength and semiconducting properties. It is also 
intrinsically interesting owing to its many competing 
complex structures, notably the a- and /3-rhombohedral 
structures characterized by differing arrangements of 
icosahedral clusters, as illustrated in Fig. [T] 

We implemented exact-exchange calculations within 
VASP on a GPU system and applied the method to cal- 
culate total energies of elemental boron structures. Two 
factors motivate this decision. 1) Competing structures 
of elemental boron are sufficiently close in energy that 
approximations within DFT might affect the identifica- 



FIG. 1: (color online) (left) Structure of a-rhombohedral Boron, (right) Structure of /3-rhombohedraI Boron. Fully occupied 
sites shown in green. Fully occupied B15 site at cell center in red. Partially occupied sites are: B13 (75%) blue; B15 100%; 
B16 (27%) yellow; B17 (9%) yellow; B18 ( 7%) magenta; B19 (7%) maroon; B20 (4%) orange. 



tion of the true low temperature state. Hence HF-type 
calculations including exact- exchange are warranted. 2) 
The structures are quite complex with the true ground 
state possibly containing 107 or more atoms per unit cell, 
usually with low symmetry. Hence the HF calculations 
will be highly demanding and the GPU system may be 
beneficial. 



A. Structural stability 

Determining the stable structure of elemental boron 
remains problematic after more than a half century of ef- 
fort. A substantial number of polymorphs are known 0, 
including at least two rhombohedral and two tetragonal 
forms as well as an orthorhombic high pressure form 8] . 
All polymorphs are built around a common motif of 12 
atom icosahedral clusters. Many of the polymorphs ap- 
pear to be kinetically trapped, or stabilized by impurities. 

Our goal is to determine the minimum enthalpy struc- 
ture in the limit of low temperature and pressure. The 
two relevant structures are the a-rhombohedral struc- 
ture [01 (Pearson type hR12, with 12 atomic sites per 
primitive cell) , which was initially believed [13] to be the 
stable form stable at low temperatures below T=1000K, 
and the more complex /3-rhombohedral structure [TT| - [l3j 
(Pearson type hR105, with 105 atomic sites per prim- 



itive cell) which was initially believed to be stable at 
high temperatures, from T=1500K up to melting. Al- 
though uncertainty remains concerning the role of impu- 
rities 14], current opinion suggests a is only metastable, 
and (3 is the true equilibrium state for all temperatures, 
from melting down to T=OK Q- 

Total energy calculations initially challenged the sta- 
bility of the /3 phase. Early calculations on isolated clus- 
ters using a molecular orbital model 15] gave hints that a 
might be lower in energy than l3. This was later demon- 
strated for bulk crystalline structures (specifically, the 
Q;-hR12 and /3-hR105 structures) using DFT 0] and 
has been since reconfirmed |T7Hl9| . The energy differ- 
ence between structures is quite small, and not well re- 
solved within DFT, ranging from A_E/3ct=48 meV/atom 
in the local density approximation (LDA) to A£^/3c<=26 
meV/atom in the PW91 [20| generalized gradient approx- 
imation (GGA). 

A possible explanation lies in inaccuracies of the as- 
sumed hR105 structure. In fact, experimentally observed 
density is consistent with 106 or 107 atoms in the rhom- 
bohedral cell. The most recent and high quality re- 
finement [2l| reports 141 atomic sites per rhombohedral 
cell, many with partial occupancy. The optimal arrange- 
ment identified using the PW91 GGA hj now places 
107 atoms among these sites and reaches Ai?^/c=-0.86 
meV/atom. That is, it predicts the partially occupied 
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/?' structure is lower in energy than a. Experimental 
evidence exists for structural anomalies at low temper- 
atures based on photoabsorptio n (22l . [23j . internal fric- 
tion [23, and heat capacity [265. These phenomena 
could be consistent with the predicted low temperature 
symmetry-breaking phase transition [l^, . 

However, the energy of the PW91-optimized structure 
still exceeds the energy of a by A/3/c = 15 meV/atom us- 
ing LDA. Given the very small energy differences (per 
atom) among competing structures, it is not certain that 
DFT is of sufficient accuracy to correctly address the rel- 
ative stabilities. In fact, the situation is reminiscient of 
a frustrated system [11]. To further test the reliability 
of our energy calculations requires climbing the "ladder 
of density functionals" [2^. The proposed sequence is 
LDA, GGA, meta-GGA, hybrid. The GGA differs from 
LDA through the inclusion of electron density gradient 
terms. Meta GGA then adds second derivatives. Un- 
fortunately, within the VASP code only the PKZB ^ 
meta-GGA is presently available, and that is included 
in a non-self-consistent fashion |3l[. Hybrid functionals 
include exact-exchange. We choose the HSE06 [H] func- 
tional as the best currently available. 



B. Setup of DFT calculations 

Our study focuses on four specific structures of inter- 
est: the 12-atom a-rhombohedral structure with Pearson 
type hR12; a 96-atom supercell of a, doubled along each 
axis in order to match the lattice parameters of /3, that 
we denote hR12x8; the ideal 105-atom /3-rhombohedral 
structure with Pearson type hR105; and the symmetry- 
broken 107-atom 13' variant that optimizes the GGA total 
energy that has Pearson type aP107. 

Because of the approximate doubling of lattice con- 
stants between a and P, we double the linear density of 
/c-point meshes for hR12 runs compared with the other 
structures. Because of the differing symmetries of the 
structures the numbers of independent fc-points (NKPT) 
differ among structures in addition. Figures for a typical 
mesh density are given below in Table HI Run times are 
linear in NKPT for conventional DFT and quadratic for 
HF-DFT. Also important are the grids used to represent 
charge densities and wavefunctions in real space, with the 
number of grid points in a line (NGX) proportional to the 
corresponding lattice parameter. The time complexity of 
FFTs grow with an A'^ log A'' factor for each spatial dimen- 
sion, or log N overall. The values quoted correspond 
to the VASP setting "PREC=Normal" , in which FFT's 
are performed without wrap-around errors. 

We utilize the Projector Augmented Wave (PAW) po- 
tentials and adopt the PBE 33}, generalized gradient ap- 
proximation for our standard exchange correlation func- 
tional. PBE is the basis on which the HF-type hybrid 
functional known as HSE06 is built (s^]. Wavefunctions 
are represented in reciprocal space with a plane wave 
energy cutoff of 319 eV (the default setting). For future 



Name 


fc-mesh 


NKPT NGX 


hR12 


4x4x4 


10 24 


hR12x8 


2x2x2 


2 48 


hR105 


2x2x2 


2 48 


aP107 


2x2x2 


4 48 



TABLE I: Parameters for typical runs. fc-point meshes 
are Monkhorst-Pack. NKPT is the number of symmetry- 
independent fc-points, and NGX the linear dimension of the 
FFT grid. 





PBE 


HP 


fc-mesh 






Afia A^/q 


1x1x1 


-6.87 


-16.45 


+10.92 -7.67 


2x2x2 


+24.39 


-1.06 


+44.25 +7.11 


3x3x3 


+26.63 


-0.17 


+46.74 +8.06 


4x4x4 


+26.07 


-0.20 





TABLE 11: Table of fc-mesh convergence, fc-point meshes 
are Monkhorst-Pack. Units are meV/atom. Energies labeled 
hR12 are for the super cell hR12x8. 

reference we show the convergence of results with respect 
to mesh density in Table |lTl Evidently a 3x3x3 mesh size 
will be adequate for resolving energy differences at the 
meV/atom level that is needed for comparison of struc- 
tural energies. However, for computational efficiency, our 
benchmarking will focus on the 2x2x2 meshes discussed 
in Table HI as they demonstrate non-trivial meshes, but 
also small enough to be easily studied. 

III. PORTING OF VASP 

VASP is widely used for DFT quantum chemistry cal- 
culations. It is a large Fortran code, with contributions 
dating back to the 1980's, and portions of the code writ- 
ten in English, French and German. VASP's performance 
has been discussed in the past [1, Q , with the most rele- 
vant discussion by Eck et al Q . The exact-exchange cal- 
culations involve less-commonly used functionality based 
on principles from HE theory. However, this function- 
ality consists of numerical operations similar to those 
used in more conventional calculations, so the perfor- 
mance enhancements obtained should transfer to more 
conventional portions of the code. 

A. CPU Performance Analysis 

Before considering performance analysis, we define the 
test platform and test cases. This study was performed 
on tirith (tr), a midrange workstation, and blacklight 
(bl), a SGI UV 1000 supercomputer at the Pittsburgh 
Supercomputing Center [34j]. Tirith is equipped with an 
Intel Core 17 920, which is a Nehalem-based quad-core 
clocked at 2.67 GHz with 8MB of cache, 12GiB of DDR3 
at 1333 MHz, a Tesla C2075 GPU, and a Tesla C2050 
GPU. Tirith's current value is around $7,000. Tirith runs 
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CompoiiBnt 


hR12 


hR12x8 


hR105 


aP107 


FOCK_ACC 


0.6208 


8.6188 


10.3627 


21.2870 


FOCK_FORCE 


1.2776 


16.7303 


19.9418 


41.0646 


Other 


0.0235 


0.1883 


0.2087 


0.3622 


Overall 


53.0 


730.2 


877.1 


1,802.2 



TABLE III: Computational cost, measured in TFLOP, of mu- 
tually exclusive sections of VASP. Tests are specified in Ta- 
ble |I] and run with a single electronic and ionic minimization 
step. Overall costs are projected assuming a total of 5 ionic 
minimization steps and 75 electronic minimization steps. 



Intel Composer XE 2011, providing compilers and BLAS 
through MKL, FFTW version 3.2.2, and the 4.0 version 
of CUDA toolkit, which includes cuBLAS, cuFFT, and 
compilers. Blacklight is a 256 blade system, each blade 
housing two Intel Xeon X7560s, which are Nehalem-based 
8-core chips clocked at 2.27 GHz with 24 MiB cache, 
and 128 GiB DDR3 at 1066 MHz. Blacklight runs In- 
tel compilers and BLAS version 11.1 and FFTW version 
3.2.2. The Integrated Performance Monitor (IPM) [H 
runs on both systems and provides timing and hardware 
counter information which can be broken down across 
user-defined sections of the code. This allowed for the 
measure of floating-point operations. 

Using the boron test structures in Table HI the effi- 
ciency of VASP on the CPU can be characterized. The 
LINPACK benchmark [s^ provides an estimate of the 
peak realizable performance: 11.19 GFLOPs on a sin- 
gle CPU core. Floating-point performance of VASP is 
found by dividing the workloads given in units of float- 
ing point operations (FLOP, or TFLOP using standard 
power-of-10 SI prefixes) in Table IIIII by the run-times 
given in units of seconds (s) in llVi yielding units of 
FLOPS (FLOP/s). In the test cases above, VASP runs 
at only 1.0-1.75 GFLOPs, or about 9-16% utilization. 

The exact-exchange functionality is primarily con- 
tained in two routines: FOCK.ACC and FOCK_FORCE. 
FOCK_ACC applies the Fock exchange operator, while 
FOCK_FORCE calculates the Hellman-Feynman forces 
acting on the ions. The two routines are nearly identical, 
with a literal copy of FOCK_ACC constituting the major- 
ity of the FOCK.FORCE routine. A set of FOCK^CC 
calls are made once per electronic minimization step, 
while a set of FOCK_FORCE calls occur once per ionic 
minimization step. 

IPM reveals the computational workload of these two 
routines. For our structures, they account for over 98% 
of fioating-point operations and over 97% of run-time, as 
seen in Table IIIII FOCK_ACC requires approximately 
half as much effort as FOCK_FORCE, per call, but is 
called much more frequently, making it the dominant rou- 
tine. 

Because FOCK_ACC and FOCK_FORCE constitute 
the overwhelming majority of each electronic and ionic 
minimization step, respectively, we can project the work- 
load and run-time of full-length runs from data collected 



in truncated test runs. The total time for a full run is 

^run — Sete + Siti + Other (1) 

where Se and Si are, respectively, the numbers of elec- 
tronic and ionic minimization steps, te and ti are the cor- 
responding time increments, and Other represents all the 
remaining portions of the code. This method is also used 
to project computational cost, where times are replaced 
by Floating-Point Operations in the formula above. Note 
that this method underestimates by neglecting the non- 
FOCK, per-minimization operations, which, along with 
the Other category, have been shown to be minimal. 

Within the FOCK routines, performance can be bro- 
ken down even further. On the CPU, BLAS calls com- 
prise 20-30% of the run-time and EFT calls comprise 
between 35-50%. The remaining 25-30% is distributed 
between data manipulation kernels and 'book-keeping'. 
This includes particular effort spent in scatter/gather op- 
erations used to express cutoffs in the projected regions. 
The heavy dependence on BLAS and EFT makes the 
VASP code a prime candidate for GPU acceleration. 

B. Our Implementation 

Our implementation aims to be a proof of concept for 
GPU acceleration of exact-exchange in VASP. We tried to 
make the implementation as simple and compartmental 
as possible, while preserving the abstraction and inter- 
faces inherent in VASP. At five points in the VASP code, 
our implementation intercepts the normal fiow of execu- 
tion. The first two are trivial: in the main routine we 
intercept at the beginning and end to create and destroy 
library contexts and global variables. The third inter- 
cepts the FFT3D routine that occur outside the FOCK 
routines, sending large EFTs to the GPU while leaving 
small ones on the CPU. The final two intercepts bypass 
the FOCK.ACC and FOCK_FORCE routines discussed 
previously. 

The optimal size at which to begin sending EFTs to 
the GPU can be computed once per system with a sim- 
ple benchmark that directly compares CPU and GPU 
run-time for EFTs. On our system, the optimal size was 
found to be 28"^. This decision does not apply to EFTs 
in the FOCK JlCC and EOCK_FORCE routines. Recent 
improvements in the cuFET library have reduced the im- 
portance of using power-of-two EFT grids compared to 
previous reports [6]. 

The two exact-exchange calculations consist of four 
nested loops, the outer two over two k-point indices and 
the inner two over band indices. The loops compute sev- 
eral quantities for every pair of bands in the structure. 
For example, the magnitude of the Fock exchange energy 
is 

e'^, f f ^3 -73 , '^kn(i-)'/'qm(rO'/'kn(rO0q„,(r) 
2 Z^/kn/qm j a vd V |r-r'| 

qm 

(2) 
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kJ Li J- m_< L Lll c 


hR12 




111 ljl_U<J 


aP107 


Platform 


cpu gpu 


Cpu gpu 


cpu gpu 


cpu gpu 


FOCK_FORCE (s) 
Other (s) 


/I no o p;o o 

789.1 290.1 
26.9 27.5 


0,Uyo.o ooi.o 

10,714.9 1,199.3 
117.8 134.6 


1 n /IflV O /I Q 
1U,4D(.Z 4o(.o 

22,144.5 1,435.5 
216.2 142.2 


iz,oDO.U yo4.z 
22,598.0 2,912.8 
206.6 246.3 


Overall (hr) 
Speedup 


9.64 1.66 
5.82x 


121.04 9.77 
12.39x 


248.88 12.20 
20.41x 


299.49 24.62 
12.17x 



TABLE IV: Run-times of components of VASP exact-exchange runs. Units are seconds (s) and hours (hr). Tests are specified 
m Table HI and run with a single electronic and ionic minimization step. Overall times are projected assuming a total of 5 ionic 
minimization steps and 75 electronic minimization steps. CPU runs are single-core and GPU runs are single- device. 



where the outer loops run over fc-points k and q and 
the inner loops run over bands m and n. We place 
the inner loops over bands on the GPU. The inner-most 
loop is multiplexed in a manner similar to that found in 
other parts of the CPU code and controlled by the NSIM 
parameter. This changes matrix-vector operations into 
more efficient matrix-matrix operations, increasing the 
data size and allowing for the application of thousands 
of GPU threads. It also allows the memory usage on 
the GPU, which is dependent on the number of bands 
processed concurrently, to be controlled at run-time. By 
moving non-intensive routines to the GPU (i.e. every op- 
eration that occurs within the two loops) , memory trans- 
fers between the host and device are minimized. The 
port operates within the existing parallelism in VASP, 
distributing data across processors with MPI. 

The GPU code relies heavily on the cuFFT and 
cuBLAS libraries for performing efficient EFT and BLAS 
operations, respectively. An additional 20 custom kernels 
were written to replicate the CPU's functionality. CUDA 
streams are used to define functional dependencies, al- 
lowing for asynchronous concurrent execution of smaller 
kernels, boosting performance on small input structures. 
Two numerical precision settings are available: full dou- 
ble precision and mixed precision, which evaluates some 
or all FFTs in single and everything else in double. The 
two precisions settings were found to agree within one 
thousandth of a percent, comparable to numerical differ- 
ences seen when running the same structure on different 
platforms. 



C. Performance Results 

There are a number of ways to compare CPU and GPU 
performance. The simplest is to look at run-times. Us- 
ing IPM, we are able to accurately measure the time 
of the FOCK.ACC and FOCK.FORCE routines, along 
with the overall time. For the boron structures in Table U 
and timings in Table HVl we see that the GPU performs 
at 7x-21x in the FOCK_ACC routine and at 3x-15x in 
the FOCK_FORCE routine. For large structures, a small 
speedup is also seen in the remaining, non-FOCK code, 
with the GPU running at up to 1.5x. Assuming the com- 
putational cost is the same for the CPU and GPU ver- 
sion, the GPU utilization can be calculated. NVIDIA's 



CUDA accelerated LINPACK running on a single GPU 
on our system performs at around 250 GELOPs. Our 
VASP implementation runs at around 9-26 GELOPs, or 
about 4-10% utilization. 

Another interesting comparison is the tirith worksta- 
tion vs. the blacklight supercomputer. We tested boron 
structures on the CPU and GPU of tirith, and 4 differ- 
ent CPU configurations on blacklight: 16, 32, 64, and 128 
core. As seen in Table IVl the gpu workstation performs 
comparably to 32-64 cores on the supercomputer for the 
structures of interest. 

For small structures, blacklight becomes saturated 
with excess communication at a relatively low core-count, 
a well-known limitation of the scaling of VASP. Tirith, 
however, behaves like a 'fat' node in that the CPU's com- 
pute power is greatly increased but the inter-process com- 
munication stays fixed. This leads the GPU to compare 
favorably to many-CPUs on small structures, as seen in 
the second row of Table IVl labeled hR105 k—2, where we 
see that 2 CPU cores plus 2 GPUs perform equivalently 
to the fastest time on the supercomputer, which occurs 
for 64 cores. 

There is, however, a limit to this effect. If the EFT 
grids are exceptionally small, then the single GPU per- 
formance is so poor that scaling effects don't make a dif- 
ference, as seen in the first row of Table |Vl 

On larger structures, the GPU efficiency dramatically 
increases, causing it to compare favorably to runs on a 
small number of CPUs. Large CPU runs benefit from 
the large system as well, though, making the compari- 
son between the GPU and a large number of CPUs less 
favorable. 

Note that the discrepancies between Table |V] and Ta- 
ble |lVl seen clearly in the hR105 case, are expected. Ta- 
ble |V] was constructed using truncated runs, increasing 
the relative impact of non-minimization overhead, which 
is negligible in the projections in Table IIVI 



IV. CONCLUSIONS 

Our exact- exchange calculations confirm that the fully 
occupied hR105 structure of /3-boron is energetically un- 
favorable with respect to a, and that symmetry-breaking 
assignments of partially occupied sites of the hR141 
structure substantially reduce the total energy per atom. 
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Structure 


k 




t 


hR12x8 


2 


hR105 


2 


hR105 


3 


aP107 


2 



T-ICOG T-ICIG T-2C2G 



98.3 
14,464.7 
15,530.5 
160,148.6 
32,178.2 



90.3 
1,650.7 
2,097.2 
20,489.9 
3,748.4 



64.8 
983.6 
1,075.2 
10,318.0 
2,168.4 



B-16C B-32C B-64C B-128C 



43.3 
1,964.8 
2,157.0 
21,080.4 
4,452.5 



47.8 60.5 

1.206.0 1,070.7 

1.201.1 1,039.7 
10,741.3 7,794.9 

2,515.4 1,900.9 



172.4 
1,160.3 
1,221.0 
5,817.5 
1,816.5 



TABLE V: Actual run-times of truncated runs, reduced NELM and NSW, of different structures on different platforms, 
tirith, B is blacklight, attributes mCnG indicates m CPU cores and n GPU devices. 



T is 



LDA PW91 PBE PKZB HE 



47.83 25.87 26.63 37.02 46.74 
15.48 -0.86 -0.17 8.53 8.06 



TABLE VI: Table of structural energies (units meV/atom). 
Here /3 refers to the ideal hR105 structure, /?' refers to the 
107 atom optimized variant of B.hR141. Energies of a are 
obtained from the super cell hR12x8. All values are given for 
the 3x3x3 fc-point mesh. 



However, our specific realization of atomic positions 
within hR141 that was optimized using the PW91 func- 
tional (GGA) proves unstable (see Table IVT| relative to 
a in our exact-exchange calculation. It appears that we 
need to re-optimize the structure of hR141 before we 
know the relative energies of a and /3 within the exact- 
exchange framework. 

The notion of a ladder of density functionals [1^ im- 
plies the existence of a sequence of calculational methods 
that converge towards an exact answer, presumably one 
that confirms experimental reality. Inspecting the series 
of energies in Table |VT] we see that indeed the values 
appear to converge. Surprisingly they converge towards 
values that are closer to the local density approximation 
than to the generalized gradient approximation. Still, the 
remaining variability among density functional suggests 
that an even higher level of theory (e.g. quantum Monte 
Carlo) might be needed to fully resolve the problem of 
the low temperature stable crystal form of boron. 

The GPU implementation of hybrid functionals in 
VASP outperformed one CPU core by about an order 
of magnitude, allowing the study of boron structures 
with exact-exchange. It would take 64 CPU cores on the 
blacklight supercomputing to achieve the same time-to- 
solution for the hR105 test case with two k-points. Even 
more remarkably, the minimal time-to-solution achiev- 
able on blacklight with any number of cores is only 
3% shorter than our 2 GPU system. This turns a 
supercomputing-level calculation into a tractable prob- 
lem that can be addressed on-demand on desktop hard- 
ware. 

Just as this speedup enables us to apply hybrid func- 
tionals to boron, it can enable other previously imprac- 



tical exact-exchange calculations such as structural tran- 
sitions in pnictide superconductors. The previously high 
cost of entry for this type of calculation, namely a large 
cluster or supercomputer access, has slowed the appli- 
cation of exact-exchange DFT calculations to practically 
sized structures. We have directly shown the feasibility of 
exact-exchange on moderately complex structures (hun- 
dreds of ions), and project that multi-node GPU clus- 
ters could be used to apply exact-exchange to even larger 
structures. 

GPU acceleration could have positive effects on general 
DFT as well. As mentioned previously and shown in Eck 
et al. Q , the methods used in this port are transferable to 
more conventional VASP use-cases. VASP scales poorly 
across large numbers of nodes in large computer systems, 
so it can be optimal to treat large complex structures us- 
ing a small number of 'fat' nodes. Previously, this meant 
spending lots of money on high-end chips, which would 
provide at most a factor of two in performance improve- 
ment (e.g. by going to high frequencies or wider SIMD 
instructions). Now, nodes can be fattened with GPUs, 
providing order-of-magnitude improvements at low cost. 
Since the communication overhead for a GPU and CPU 
are similar, it is reasonable to expect the largest CPU- 
based system to scale beyond the largest CPU-based sys- 
tems by the same factor that a single GPU outperforms 
a single CPU. Based on our analysis, that means a factor 
of 2 or three increase in structure size or, in the case of 
molecular dynamics, an order of magnitude increase in 
achievable simulation times. 
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